I posit there are two types of errors that arise from geocoding. One is where the API totally misinterprets the input and geocodes an address to a latitude and longitude that aren’t even in the city of Austin. I’ll call that “Wrong City” error. The other is when the police record contains an address that would be considered ambiguous even to human eyes, for instance a street name with no street or block number. Then the geocoder is forced to cite a precise location and hence is vulnerable to a possibly large discrepancy. I’ll call this “Ambiguous Address” error.
I’ve mostly stuck to Google Maps API through the ggmap package, but I also uploaded a spreadsheet of 3,500 unique addresses to geocod.io to test their reverse geocoding. The output was nicer than ggmaps in that it has a confidence score and a census tract. It cost me $2.12. Let’s peak at the data:
# helper function
showTable <- . %>%
kbl() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
geocodio <- vroom(here("data/geocodio_trial_run.csv.gz")) %>% clean_names()
geocodio %>%
select(1:5, city, census_tract_code) %>%
head() %>%
showTable()
| address | latitude | longitude | accuracy_score | accuracy_type | city | census_tract_code |
|---|---|---|---|---|---|---|
| 700 BLOCK EAST 10TH STREET Austin, TX | 30.32637 | -97.77126 | 0.33 | place | Austin | 000102 |
| 500 BLOCK EAST 6TH STREET Austin, TX | 30.32637 | -97.77126 | 0.33 | place | Austin | 000102 |
| 6TH STREET / TRINITY STREET Austin, TX | 30.26707 | -97.73933 | 0.80 | intersection | Austin | 001100 |
| 1200 BLOCK NORTH I H 35 Austin, TX | 30.32637 | -97.77126 | 0.33 | place | Austin | 000102 |
| 1500 BLOCK I H 35 Austin, TX | 30.32637 | -97.77126 | 0.33 | place | Austin | 000102 |
| 600 BLOCK RED RIVER STREET Austin, TX | 30.32637 | -97.77126 | 0.33 | place | Austin | 000102 |
So a first sanity check for “Wrong City” error would be to see if there are any addresses not geocoded to the city of Austin:
geocodio %>% distinct(city)
## # A tibble: 25 x 1
## city
## <chr>
## 1 Austin
## 2 Round Rock
## 3 San Antonio
## 4 Jarrell
## 5 Marble Falls
## 6 Pflugerville
## 7 Smithville
## 8 San Marcos
## 9 Cedar Park
## 10 Georgetown
## # … with 15 more rows
These are Texas cities, but they are definitely not suburbs of Austin. The austin_city_limits.shp file contains geometries for the city council districts in Austin. Let’s use a rough polygon of those districts to see how often geocod.io geocoded to a point outside of Austin:
# using the sf package here
austin_city_limits <-
st_read(here("data/austin_city_limits.shp")) %>%
st_union() %>%
st_convex_hull()
## Reading layer `austin_city_limits' from data source `/Users/nate/workspace/Austin Homelessness/data/austin_city_limits.shp' using driver `ESRI Shapefile'
## Simple feature collection with 199 features and 9 fields
## geometry type: MULTIPOLYGON
## dimension: XY
## bbox: xmin: -98.01504 ymin: 30.04298 xmax: -97.48034 ymax: 30.51685
## geographic CRS: WGS84(DD)
# create an "sf" object which is just a dataframe with geometries attached
geocodio_geometry <- geocodio %$% map2(longitude, latitude, ~ st_point(c(.x, .y)))
geocodio %<>% st_sf(geometry = geocodio_geometry, crs = st_crs(austin_city_limits))
geocodio %<>% cbind(within_austin = st_intersects(geocodio, austin_city_limits, sparse = F))
# how many points were not in Austin?
geocodio %>%
count(within_austin) %>%
as_tibble() %>%
select(within_austin, n)
## # A tibble: 2 x 2
## within_austin n
## <lgl> <int>
## 1 FALSE 128
## 2 TRUE 3239
That doesn’t look too bad… about 4% of the geocoded addresses are outside of the city of Austin. I’m comfortable dropping them, but let’s take a closer look at the data and see if we can learn anything about the addresses that are getting geocoded to a location outside of the city of Austin.
# let's view a few of the points that are not in Austin
geocodio %>%
dplyr::filter(!within_austin) %>%
select(address, accuracy_score, city, county) %>%
head(10) %>%
showTable()
| address | accuracy_score | city | county | geometry |
|---|---|---|---|---|
| 110 NORTH IH 35 SERVICE ROAD Austin, TX | 0.90 | Round Rock | Williamson County | POINT (-97.6872 30.5083) |
| 8TH AT COLORADO Austin, TX | 1.00 | San Antonio | Bexar County | POINT (-98.48153 29.43139) |
| N. IH35/WF Austin, TX | 0.68 | Jarrell | Williamson County | POINT (-97.60652 30.8247) |
| N. IH35 S/B/E. CESAR CHAVEZ Austin, TX | 0.68 | San Antonio | Bexar County | POINT (-98.47967 29.4381) |
| E. 11TH STREET/WF/IH35 Austin, TX | 0.76 | Marble Falls | Burnet County | POINT (-98.26873 30.57753) |
| W. 3RD STREET/LAVACA STREET Austin, TX | 0.80 | Smithville | Bastrop County | POINT (-97.16934 30.01062) |
| NORTH IH35 AND EAST 12TH STREET Austin, TX | 0.15 | San Marcos | Hays County | POINT (-97.92301 29.88173) |
| 12TH STREET/IH35 SB Austin, TX | 0.82 | San Antonio | Bexar County | POINT (-98.48095 29.43759) |
| 6th At Neches Street Austin, TX | 1.00 | Georgetown | Williamson County | POINT (-97.68081 30.63813) |
| 12th Street And Ih35 South Bound Austin, TX | 0.82 | San Antonio | Bexar County | POINT (-98.48095 29.43759) |
As a native Austinite I can locate these addresses within the city of Austin. Non-natives may not realize how wrong the given estimates are. Let’s plot the points to get a better sense:
# exclude anything from the plot that is obviously wrong (two points were located in Boston :( )
geocodio_minus_boston <- geocodio %>% dplyr::filter(longitude < -90)
stamen_map <- st_bbox(geocodio_minus_boston) %>%
as.numeric() %>%
get_stamenmap()
ggmap(stamen_map) +
geom_sf(data = geocodio_minus_boston, inherit.aes = F, color = "#F54D97")
Let’s do exactly what we did with geocod.io and see how good Google Maps did. First, let’s peak at the shape of the data and then create sf POINT objects like we did before.
google_maps <- vroom(here("data/google_cache.csv.gz")) %>% clean_names()
google_maps %>%
head(5) %>%
showTable()
| google_search | lon | lat |
|---|---|---|
| 115 SANDRA MURAIDA WAY Austin, TX | -97.75483 | 30.26827 |
| 12600 BLOCK NORTH IH 35 Austin, TX | -97.71769 | 30.29847 |
| 800 BLOCK EMBASY DRIVE Austin, TX | -97.73227 | 30.26757 |
| 800 BLOCK ELM BERRY Austin, TX | -97.74475 | 30.26461 |
| WEST 5TH STREET and SOUTH MOPAC EXPRESSWAY Austin, TX | -97.76446 | 30.27393 |
# create an "sf" object which is just a dataframe with geometries attached
maps_geometry <- google_maps %$% map2(lon, lat, ~ st_point(c(.x, .y)))
google_maps %<>% st_sf(geometry = maps_geometry, crs = st_crs(austin_city_limits))
google_maps %<>% cbind(within_austin = st_intersects(google_maps, austin_city_limits, sparse = F))
# how many points were not in Austin?
google_maps %>%
count(within_austin) %>%
as_tibble() %>%
select(within_austin, n)
## # A tibble: 2 x 2
## within_austin n
## <lgl> <int>
## 1 FALSE 116
## 2 TRUE 3576
Peaking at the data for “Wrong City” points here would not be very informative since we just have lat and lon and no other identifying information. Let’s just skip straight to the plot:
ggmap(stamen_map) +
geom_sf(data = google_maps, inherit.aes = F, color = "#F54D97")
There were definitely errors that both APIs committed. From the maps it doesn’t look like they’re committing the “same type” of error, i.e. it looks like errors are clustered in differenct places (in San Antonio for geocod.io, and along I-35 for Google Maps). So perhaps we can conclude that in any one instance, one of the answers is “good enough”.
Let’s build a dataset that contains the distance between the two APIs for datapoints. That way, we can look at addresses where geocod.io and Google Maps really disagree.
citations_by_tract <- vroom(here("data/citations_by_tract.csv.gz"))
comparison_dataset <- citations_by_tract %>%
distinct(google_search) %>%
inner_join(google_maps, c("google_search")) %>%
inner_join(geocodio, c("google_search" = "address"))
distance <- ~ st_distance(.x, .y) %>%
units::set_units(miles) %>%
as.numeric()
comparison_dataset %<>% mutate(distance_miles = map2_dbl(geometry.x, geometry.y, distance)) %>% rename(within_austin_google = within_austin.x, within_austin_geocodio = within_austin.y)
# let's see if we can construct a table of accuracy by within_austin_XXX
comparison_dataset %>%
group_by(within_austin_google, within_austin_geocodio) %>%
summarise(count = n(), `Miles 25%` = quantile(distance_miles, 0.25), `Miles 50%` = median(distance_miles), `Miles 75%` = quantile(distance_miles, 0.75)) %>%
showTable()
| within_austin_google | within_austin_geocodio | count | Miles 25% | Miles 50% | Miles 75% |
|---|---|---|---|---|---|
| FALSE | FALSE | 7 | 0.2408126 | 0.7046287 | 1.3759725 |
| FALSE | TRUE | 84 | 0.5113389 | 0.6710727 | 1.1977254 |
| TRUE | FALSE | 102 | 0.4425330 | 0.5751898 | 0.6283034 |
| TRUE | TRUE | 2713 | 0.0001669 | 0.0155369 | 0.0675446 |
ggplot(comparison_dataset) +
geom_histogram(aes(distance_miles, fill = glue("{within_austin_geocodio}{within_austin_google}")), bins = 75) +
facet_wrap(~ within_austin_google + within_austin_geocodio, ncol = 1, scales = "free", labeller = "label_both") +
theme_fivethirtyeight() +
theme(legend.position = "none") +
labs(title = "Distance in miles between Google and Geocod.io Estimate", subtitle = "Cross-tabulated by \"Wrong City\" errors*", caption = "*Note scale differences")
The tables suggest a heuristic approach: First, there is probably very little “Ambiguous Address” bias, as shown by the mean distance when both geocoders are within Austin City Limits.
I will take being inside Austin City Limits as a sufficient condition for accuracy, then. That means that, using Google Maps API as the initial source of truth,